CMPINF 2120: Homework 08¶
Hua Ji!!!¶
Assigned: Thursday of Week 08 at 11:00PM¶
DUE: Thursday of Week 09 at 11:59PM¶
This assignment is focused on fitting logistic regression models for binary classification. You will work with a single data set that consists of mixed inputs and a binary outcome. The inputs names start with the letter x and the output is named y. You do not need to worry about preprocessing the continuous inputs for this assignment, you can focus on the programming details of fitting logistic regression models with statsmodels.
You will begin by exploring the data using figure appropriate for binary classification tasks. You then fit multiple models ranging from very simple to quite complex models. You will use information criterion metrics to identify the best models. You will interpret the best models by examining coefficients and visualizing predictions of the event probability.
You may add as many code and markdown cells as you see fit to answer the questions.
Collaborators¶
Type the names of other students you worked with here.
Problem 00¶
Let's import all required modules and functions for this assignment upfront.
0a)¶
Import the "big 4" data analysis modules of NumPy, Pandas, matplotlib pyplot, and Seaborn. You must use their common aliases.
0a) - SOLUTION¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from patsy import dmatrices
Problem 01¶
You will work with a single data set for this assignment. This is a synthetic data set designed to represent common relationships you may encounter when training binary classifiers. The inputs themselves are not "challenging" so that you may focus on the relationships via features derived from the inputs in this application.
That said, it is essential to always explore data before training models! This problem requires you visually explore the data using figures appropriate binary classification tasks. However, the problem is open ended. There are no prompts for the specific types of figures you must make! You must decide which figures are critical for visually exploring the output to input relationships associated with binary classification problems.
HINT: The Week 08 lecture recordings help!
The synthetic data file name is assigned to the probA_path object in the cell below. The file name assumes that you have downloaded the hw08_probA.csv file into the same working directory as this notebook.
probA_path = 'hw08_probA.csv'
Read in the data using the appropriate Pandas function and assign the result to the dfA object.
Visually explore the inputs, output, and the input-output relationships using figures appropriate for binary classification problems.
You may add as many cells as you feel are necessary.
1) - SOLUTION¶
# read in data
dfA=pd.read_csv(probA_path)
dfA.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 555 entries, 0 to 554 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 555 non-null float64 1 x2 555 non-null float64 2 x3 555 non-null object 3 y 555 non-null int64 dtypes: float64(2), int64(1), object(1) memory usage: 17.5+ KB
dfA.nunique()
x1 555 x2 555 x3 5 y 2 dtype: int64
dfA.isna().sum()
x1 0 x2 0 x3 0 y 0 dtype: int64
sns.catplot(dfA, x='y', kind='count', height =4)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
dfA.y.value_counts(normalize=True)
y 1 0.526126 0 0.473874 Name: proportion, dtype: float64
sns.catplot(dfA, x='x3', kind='count')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
We can see the outputs are balanced.
sns.catplot(dfA, kind='box')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(dfA, x='x1', kind='hist')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(dfA, x='x2', kind='hist')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Since the boxplot shows that the magnitude of the inputs are almost the same, so we don't need standardization. Since x1 and x2 are nearly perfect symmetric, so we don't need to do the transformation.
# explore the relationship between x1 and x2.
sns.relplot(dfA, x='x1',y='x2', hue='x3')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.relplot(dfA, x='x1', y='x2', col='x3',col_wrap=2, hue='y')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.lmplot(dfA, x='x1', y='y',logistic=True )
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.lmplot(dfA, x='x2', y='y', logistic=True)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Problem 02¶
Now that you have explored the data, it is time to consider the types of features that you will include in your logistic regression models. The examples in the lecture recordings jumped straight to fitting the models, but you will first generate the design matrices (input feature arrays) before fitting the models. You will therefore know ahead of time how many features will be included in each model!
2a)¶
You will consider many models in this assignment ranging from simple to complex. You will create the output and input feature arrays for each model using dmatrices() to identify the number of features associated with each model. You will fit 11 models following the guidelines discussed in the lecture recordings. You must therefore generate 11 sets of output and input feature arrays in this problem. All output and input feature arrays must be returned as Pandas DataFrames (thus array is a misnomer relative to the data type). All objects must using the following naming convention:
- The output feature array must be named
yA_df_XXwhereXXis the model ID designation - The input feature array must be named
XA_df_XXwhereXXis the model ID designation
The 11 models will include different features derived from the inputs. The 11 models you will consider are:
- model 00: INTERCEPT-ONLY model
- model 01: Only categorical input
- model 02: Only continuous inputs - linear additive features
- model 03: All inputs - linear additive features
- model 04: Interact the categorical input with linear additive features from the continuous inputs
- model 05: Add the categorical input to main effect and pairwise products (interactions) between continuous inputs
- model 06: Interact the categorical input with main effect and pairwise products between continuous inputs
- model 07: Add the categorical input to linear and quadratic features derived from all continuous inputs
- model 08: Interact the categorical input with linear and quadratic features derived from all continuous inputs
- model 09: Interact the categorical input with main effects, pairwise products, and quadratic features derived from all continuous inputs
- model 10: Interact linear and quadratic features between the continuous inputs and interact all continuous features with the categorical input
IMPORTANT: Make sure the main effects are ALSO included in models with interactions!
You may add as many cells as you feel are necessary.
2a) - SOLUTION¶
# INTERCEPT-ONLY model
yA_df_00, XA_df_00 =dmatrices('y~1', data=dfA, return_type="dataframe")
yA_df_00
| y | |
|---|---|
| 0 | 0.0 |
| 1 | 0.0 |
| 2 | 0.0 |
| 3 | 1.0 |
| 4 | 1.0 |
| ... | ... |
| 550 | 1.0 |
| 551 | 1.0 |
| 552 | 1.0 |
| 553 | 1.0 |
| 554 | 1.0 |
555 rows × 1 columns
XA_df_00
| Intercept | |
|---|---|
| 0 | 1.0 |
| 1 | 1.0 |
| 2 | 1.0 |
| 3 | 1.0 |
| 4 | 1.0 |
| ... | ... |
| 550 | 1.0 |
| 551 | 1.0 |
| 552 | 1.0 |
| 553 | 1.0 |
| 554 | 1.0 |
555 rows × 1 columns
# Only categorical input
yA_df_01,XA_df_01=dmatrices('y~x3', data=dfA, return_type="dataframe")
XA_df_01
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | |
|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
555 rows × 5 columns
# Only continuous inputs - linear additive features
yA_df_02,XA_df_02=dmatrices('y~x1+x2', data=dfA, return_type='dataframe')
XA_df_02
| Intercept | x1 | x2 | |
|---|---|---|---|
| 0 | 1.0 | -0.120768 | -0.305824 |
| 1 | 1.0 | 1.042928 | -2.007914 |
| 2 | 1.0 | -2.639345 | -0.221434 |
| 3 | 1.0 | 0.542346 | 2.671225 |
| 4 | 1.0 | 3.124682 | 0.956447 |
| ... | ... | ... | ... |
| 550 | 1.0 | -1.337680 | -0.441454 |
| 551 | 1.0 | -1.848733 | 1.113199 |
| 552 | 1.0 | -0.217562 | 0.432311 |
| 553 | 1.0 | -2.698854 | -0.852384 |
| 554 | 1.0 | -0.542255 | 2.512838 |
555 rows × 3 columns
# All inputs - linear additive features
yA_df_03,XA_df_03=dmatrices('y~x1+x2 +x3', data=dfA, return_type='dataframe')
XA_df_03
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x2 | |
|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.305824 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | -2.007914 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.221434 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 2.671225 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.956447 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.441454 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | 1.113199 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | 0.432311 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.852384 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | 2.512838 |
555 rows × 7 columns
# Interact the categorical input with linear additive features from the continuous inputs
yA_df_04, XA_df_04=dmatrices('y~x3*(x1+x2)',data=dfA, return_type="dataframe")
XA_df_04
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x3[T.B]:x1 | x3[T.C]:x1 | x3[T.D]:x1 | x3[T.E]:x1 | x2 | x3[T.B]:x2 | x3[T.C]:x2 | x3[T.D]:x2 | x3[T.E]:x2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.0 | -0.0 | -0.0 | -0.000000 | -0.305824 | -0.0 | -0.0 | -0.0 | -0.000000 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | 0.0 | 0.0 | 0.0 | 0.000000 | -2.007914 | -0.0 | -0.0 | -0.0 | -0.000000 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.0 | -0.0 | -0.0 | -0.000000 | -0.221434 | -0.0 | -0.0 | -0.0 | -0.000000 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 0.0 | 0.0 | 0.0 | 0.000000 | 2.671225 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.956447 | 0.0 | 0.0 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.0 | -0.0 | -0.0 | -1.337680 | -0.441454 | -0.0 | -0.0 | -0.0 | -0.441454 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | -0.0 | -0.0 | -0.0 | -1.848733 | 1.113199 | 0.0 | 0.0 | 0.0 | 1.113199 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | -0.0 | -0.0 | -0.0 | -0.217562 | 0.432311 | 0.0 | 0.0 | 0.0 | 0.432311 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.0 | -0.0 | -0.0 | -2.698854 | -0.852384 | -0.0 | -0.0 | -0.0 | -0.852384 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | -0.0 | -0.0 | -0.0 | -0.542255 | 2.512838 | 0.0 | 0.0 | 0.0 | 2.512838 |
555 rows × 15 columns
# Add the categorical input to main effect and pairwise products (interactions) between continuous inputs
yA_df_05, XA_df_05=dmatrices('y~x3+(x1+x2)**2',data=dfA, return_type='dataframe')
XA_df_05
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x2 | x1:x2 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.305824 | 0.036934 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | -2.007914 | -2.094110 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.221434 | 0.584441 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 2.671225 | 1.448729 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.956447 | 2.988591 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.441454 | 0.590524 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | 1.113199 | -2.058007 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | 0.432311 | -0.094055 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.852384 | 2.300459 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | 2.512838 | -1.362600 |
555 rows × 8 columns
# Interact the categorical input with main effect and pairwise products between continuous inputs
yA_df_06, XA_df_06=dmatrices('y~x3*((x1+x2)**2)',data=dfA, return_type="dataframe")
XA_df_06
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x3[T.B]:x1 | x3[T.C]:x1 | x3[T.D]:x1 | x3[T.E]:x1 | x2 | x3[T.B]:x2 | x3[T.C]:x2 | x3[T.D]:x2 | x3[T.E]:x2 | x1:x2 | x3[T.B]:x1:x2 | x3[T.C]:x1:x2 | x3[T.D]:x1:x2 | x3[T.E]:x1:x2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.0 | -0.0 | -0.0 | -0.000000 | -0.305824 | -0.0 | -0.0 | -0.0 | -0.000000 | 0.036934 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | 0.0 | 0.0 | 0.0 | 0.000000 | -2.007914 | -0.0 | -0.0 | -0.0 | -0.000000 | -2.094110 | -0.0 | -0.0 | -0.0 | -0.000000 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.0 | -0.0 | -0.0 | -0.000000 | -0.221434 | -0.0 | -0.0 | -0.0 | -0.000000 | 0.584441 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 0.0 | 0.0 | 0.0 | 0.000000 | 2.671225 | 0.0 | 0.0 | 0.0 | 0.000000 | 1.448729 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.956447 | 0.0 | 0.0 | 0.0 | 0.000000 | 2.988591 | 0.0 | 0.0 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.0 | -0.0 | -0.0 | -1.337680 | -0.441454 | -0.0 | -0.0 | -0.0 | -0.441454 | 0.590524 | 0.0 | 0.0 | 0.0 | 0.590524 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | -0.0 | -0.0 | -0.0 | -1.848733 | 1.113199 | 0.0 | 0.0 | 0.0 | 1.113199 | -2.058007 | -0.0 | -0.0 | -0.0 | -2.058007 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | -0.0 | -0.0 | -0.0 | -0.217562 | 0.432311 | 0.0 | 0.0 | 0.0 | 0.432311 | -0.094055 | -0.0 | -0.0 | -0.0 | -0.094055 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.0 | -0.0 | -0.0 | -2.698854 | -0.852384 | -0.0 | -0.0 | -0.0 | -0.852384 | 2.300459 | 0.0 | 0.0 | 0.0 | 2.300459 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | -0.0 | -0.0 | -0.0 | -0.542255 | 2.512838 | 0.0 | 0.0 | 0.0 | 2.512838 | -1.362600 | -0.0 | -0.0 | -0.0 | -1.362600 |
555 rows × 20 columns
# Add the categorical input to linear and quadratic features derived from all continuous inputs
yA_df_07,XA_df_07=dmatrices('y~x3+x1+x2+np.power(x1,2)+np.power(x2,2)', data=dfA, return_type='dataframe')
XA_df_07
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x2 | np.power(x1, 2) | np.power(x2, 2) | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.305824 | 0.014585 | 0.093528 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | -2.007914 | 1.087699 | 4.031717 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.221434 | 6.966141 | 0.049033 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 2.671225 | 0.294140 | 7.135445 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.956447 | 9.763638 | 0.914790 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.441454 | 1.789388 | 0.194881 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | 1.113199 | 3.417813 | 1.239211 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | 0.432311 | 0.047333 | 0.186893 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.852384 | 7.283813 | 0.726558 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | 2.512838 | 0.294041 | 6.314356 |
555 rows × 9 columns
# Interact the categorical input with linear and quadratic features derived from all continuous inputs
yA_df_08,XA_df_08=dmatrices('y~x3*(x1+x2+np.power(x1,2)+np.power(x2,2))',data=dfA,return_type="dataframe")
XA_df_08
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x3[T.B]:x1 | x3[T.C]:x1 | x3[T.D]:x1 | x3[T.E]:x1 | ... | np.power(x1, 2) | x3[T.B]:np.power(x1, 2) | x3[T.C]:np.power(x1, 2) | x3[T.D]:np.power(x1, 2) | x3[T.E]:np.power(x1, 2) | np.power(x2, 2) | x3[T.B]:np.power(x2, 2) | x3[T.C]:np.power(x2, 2) | x3[T.D]:np.power(x2, 2) | x3[T.E]:np.power(x2, 2) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.0 | -0.0 | -0.0 | -0.000000 | ... | 0.014585 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.093528 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 1.087699 | 0.0 | 0.0 | 0.0 | 0.000000 | 4.031717 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.0 | -0.0 | -0.0 | -0.000000 | ... | 6.966141 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.049033 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.294140 | 0.0 | 0.0 | 0.0 | 0.000000 | 7.135445 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 9.763638 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.914790 | 0.0 | 0.0 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.0 | -0.0 | -0.0 | -1.337680 | ... | 1.789388 | 0.0 | 0.0 | 0.0 | 1.789388 | 0.194881 | 0.0 | 0.0 | 0.0 | 0.194881 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | -0.0 | -0.0 | -0.0 | -1.848733 | ... | 3.417813 | 0.0 | 0.0 | 0.0 | 3.417813 | 1.239211 | 0.0 | 0.0 | 0.0 | 1.239211 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | -0.0 | -0.0 | -0.0 | -0.217562 | ... | 0.047333 | 0.0 | 0.0 | 0.0 | 0.047333 | 0.186893 | 0.0 | 0.0 | 0.0 | 0.186893 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.0 | -0.0 | -0.0 | -2.698854 | ... | 7.283813 | 0.0 | 0.0 | 0.0 | 7.283813 | 0.726558 | 0.0 | 0.0 | 0.0 | 0.726558 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | -0.0 | -0.0 | -0.0 | -0.542255 | ... | 0.294041 | 0.0 | 0.0 | 0.0 | 0.294041 | 6.314356 | 0.0 | 0.0 | 0.0 | 6.314356 |
555 rows × 25 columns
# Interact the categorical input with main effects, pairwise products, and quadratic features derived from all continuous inputs
yA_df_09,XA_df_09=dmatrices('y~x3*((x1+x2)**2+np.power(x1,2)+np.power(x2,2))',data=dfA, return_type="dataframe")
XA_df_09
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x3[T.B]:x1 | x3[T.C]:x1 | x3[T.D]:x1 | x3[T.E]:x1 | ... | np.power(x1, 2) | x3[T.B]:np.power(x1, 2) | x3[T.C]:np.power(x1, 2) | x3[T.D]:np.power(x1, 2) | x3[T.E]:np.power(x1, 2) | np.power(x2, 2) | x3[T.B]:np.power(x2, 2) | x3[T.C]:np.power(x2, 2) | x3[T.D]:np.power(x2, 2) | x3[T.E]:np.power(x2, 2) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.0 | -0.0 | -0.0 | -0.000000 | ... | 0.014585 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.093528 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 1.087699 | 0.0 | 0.0 | 0.0 | 0.000000 | 4.031717 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.0 | -0.0 | -0.0 | -0.000000 | ... | 6.966141 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.049033 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.294140 | 0.0 | 0.0 | 0.0 | 0.000000 | 7.135445 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 9.763638 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.914790 | 0.0 | 0.0 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.0 | -0.0 | -0.0 | -1.337680 | ... | 1.789388 | 0.0 | 0.0 | 0.0 | 1.789388 | 0.194881 | 0.0 | 0.0 | 0.0 | 0.194881 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | -0.0 | -0.0 | -0.0 | -1.848733 | ... | 3.417813 | 0.0 | 0.0 | 0.0 | 3.417813 | 1.239211 | 0.0 | 0.0 | 0.0 | 1.239211 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | -0.0 | -0.0 | -0.0 | -0.217562 | ... | 0.047333 | 0.0 | 0.0 | 0.0 | 0.047333 | 0.186893 | 0.0 | 0.0 | 0.0 | 0.186893 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.0 | -0.0 | -0.0 | -2.698854 | ... | 7.283813 | 0.0 | 0.0 | 0.0 | 7.283813 | 0.726558 | 0.0 | 0.0 | 0.0 | 0.726558 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | -0.0 | -0.0 | -0.0 | -0.542255 | ... | 0.294041 | 0.0 | 0.0 | 0.0 | 0.294041 | 6.314356 | 0.0 | 0.0 | 0.0 | 6.314356 |
555 rows × 30 columns
# Interact linear and quadratic features between the continuous inputs and interact all continuous features with the categorical input
yA_df_10,XA_df_10=dmatrices('y~((x1+np.power(x1,2))*(x2+np.power(x2,2)))*x3', data=dfA, return_type="dataframe")
XA_df_10
| Intercept | x3[T.B] | x3[T.C] | x3[T.D] | x3[T.E] | x1 | x1:x3[T.B] | x1:x3[T.C] | x1:x3[T.D] | x1:x3[T.E] | ... | np.power(x1, 2):x2 | np.power(x1, 2):x2:x3[T.B] | np.power(x1, 2):x2:x3[T.C] | np.power(x1, 2):x2:x3[T.D] | np.power(x1, 2):x2:x3[T.E] | np.power(x1, 2):np.power(x2, 2) | np.power(x1, 2):np.power(x2, 2):x3[T.B] | np.power(x1, 2):np.power(x2, 2):x3[T.C] | np.power(x1, 2):np.power(x2, 2):x3[T.D] | np.power(x1, 2):np.power(x2, 2):x3[T.E] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.120768 | -0.0 | -0.0 | -0.0 | -0.000000 | ... | -0.004460 | -0.0 | -0.0 | -0.0 | -0.000000 | 0.001364 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.042928 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | -2.184006 | -0.0 | -0.0 | -0.0 | -0.000000 | 4.385296 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.639345 | -0.0 | -0.0 | -0.0 | -0.000000 | ... | -1.542542 | -0.0 | -0.0 | -0.0 | -0.000000 | 0.341572 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.542346 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.785713 | 0.0 | 0.0 | 0.0 | 0.000000 | 2.098817 | 0.0 | 0.0 | 0.0 | 0.000000 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.124682 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 9.338398 | 0.0 | 0.0 | 0.0 | 0.000000 | 8.931679 | 0.0 | 0.0 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.337680 | -0.0 | -0.0 | -0.0 | -1.337680 | ... | -0.789932 | -0.0 | -0.0 | -0.0 | -0.789932 | 0.348718 | 0.0 | 0.0 | 0.0 | 0.348718 |
| 551 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1.848733 | -0.0 | -0.0 | -0.0 | -1.848733 | ... | 3.804705 | 0.0 | 0.0 | 0.0 | 3.804705 | 4.235393 | 0.0 | 0.0 | 0.0 | 4.235393 |
| 552 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.217562 | -0.0 | -0.0 | -0.0 | -0.217562 | ... | 0.020463 | 0.0 | 0.0 | 0.0 | 0.020463 | 0.008846 | 0.0 | 0.0 | 0.0 | 0.008846 |
| 553 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -2.698854 | -0.0 | -0.0 | -0.0 | -2.698854 | ... | -6.208603 | -0.0 | -0.0 | -0.0 | -6.208603 | 5.292111 | 0.0 | 0.0 | 0.0 | 5.292111 |
| 554 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.542255 | -0.0 | -0.0 | -0.0 | -0.542255 | ... | 0.738877 | 0.0 | 0.0 | 0.0 | 0.738877 | 1.856678 | 0.0 | 0.0 | 0.0 | 1.856678 |
555 rows × 45 columns
X_matrice=[XA_df_00,XA_df_01,XA_df_02,XA_df_03,XA_df_04,XA_df_04,XA_df_05,XA_df_06,XA_df_07,XA_df_08,XA_df_09,XA_df_10]
[matrice.shape[1] for matrice in X_matrice]
[1, 5, 3, 7, 15, 15, 8, 20, 9, 25, 30, 45]
The intercept only model has the least number of features. and the XA_df_10 has the MOST number of features.
X_matrice=[XA_df_00,XA_df_01,XA_df_02,XA_df_03,XA_df_04,XA_df_04,XA_df_05,XA_df_06,XA_df_07,XA_df_08,XA_df_09,XA_df_10]
[matrice.shape[1] for matrice in X_matrice]
[1, 5, 3, 7, 15, 15, 8, 20, 9, 25, 30, 45]
The intercept only model has the FEWEST regression coefficients to estimate. And the 11 the model has the MOST coefficients to estimate.
2d)¶
This week lecture examples concluded by showing how Information Criterion metrics like AIC and BIC can be used to identify the BEST model on the training set alone while guarding against overfitting. This is accomplished by penalizing the training set model performance.
Which of the 11 models will have the SMALLEST penalty according to AIC?
Which of the 11 models will have the GREATEST penalty according to AIC?
Type your response in a Markdown cell below.
2d) - SOLUTION¶
Since $$\mathrm{AIC} = 2 \times k - 2\times \log\left( \mathrm{Likelihood} \right) $$, the more features in the model the more unknowns in the model, the greater the penalty!!!. We can say that Model_00 has the least features, so the penalty will be least and model_11 has the most features, so it will have the GREATEST penalty.
2e)¶
This week lecture examples concluded by showing how Information Criterion metrics like AIC and BIC can be used to identify the BEST model on the training set alone while guarding against overfitting. This is accomplished by penalizing the training set model performance.
Which of the 11 models will have the SMALLEST penalty according to BIC?
Which of the 11 models will have the GREATEST penalty according to BIC?
Type your response in a Markdown cell below.
2e) - SOLUTION¶
where N is the number of observations. So according to the BIC equation we can see that the penalty for BIC depends on the number of unknowns and the number of observations. According to the 11 models' feature numbers, model_00 has the least number of unknowns and observations, so model_00 has the Smallest penality according to BIC and Model_11 has the most number of unknowns and observations, so model_11 has the GREATEST penalty according to BIC
Problem 03¶
It is now time to fit the 11 logistic regression models! You must fit all models using the statsmodels formula interface. After fitting the models you will identify the best models using AIC and BIC.
3a)¶
The fitted models must be assigned to objects that follow a specific naming convention. The model objects must be named as fitA_XX where XX is the model ID designation. Each model uses a different set features and thus use different formulas. The formulas were described in the previous problem, but they are repeated below for your convenience.
- model 00: INTERCEPT-ONLY model
- model 01: Only categorical input
- model 02: Only continuous inputs - linear additive features
- model 03: All inputs - linear additive features
- model 04: Interact the categorical input with linear additive features from the continuous inputs
- model 05: Add the categorical input to main effect and pairwise products (interactions) between continuous inputs
- model 06: Interact the categorical input with main effect and pairwise products between continuous inputs
- model 07: Add the categorical input to linear and quadratic features derived from all continuous inputs
- model 08: Interact the categorical input with linear and quadratic features derived from all continuous inputs
- model 09: Interact the categorical input with main effects, pairwise products, and quadratic features derived from all continuous inputs
- model 10: Interact linear and quadratic features between the continuous inputs and interact all continuous features with the categorical input
Fit all 11 models and assign the fitted model objects to the appropriate object names.
You may add as many cells as you feel are necessary.
3a) - SOLUTION¶
fitA_00=smf.logit('y~1',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.691781
Iterations 3
fitA_00.params
Intercept 0.1046 dtype: float64
fitA_01=smf.logit(formula='y~x3',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.687414
Iterations 4
fitA_01.params
Intercept -5.406722e-02 x3[T.B] -5.036176e-17 x3[T.C] 5.123748e-01 x3[T.D] 1.803609e-01 x3[T.E] 1.081344e-01 dtype: float64
fitA_02=smf.logit(formula='y~x1+x2',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.663425
Iterations 5
fitA_03=smf.logit(formula='y~x1+x2+x3',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.656500
Iterations 5
fitA_04=smf.logit(formula='y~x3*(x1+x2)',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.569756
Iterations 6
fitA_05=smf.logit(formula='y~x3+(x1+x2)**2',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.655574
Iterations 5
fitA_06=smf.logit(formula='y~x3*((x1+x2)**2)',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.564722
Iterations 7
fitA_07=smf.logit(formula='y~x3+x1+x2+np.power(x1,2)+np.power(x2,2)',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.648063
Iterations 5
fitA_08=smf.logit(formula='y~x3*(x1+x2+np.power(x1,2)+np.power(x2,2))',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.531693
Iterations 7
fitA_09=smf.logit(formula='y~x3*((x1+x2)**2+np.power(x1,2)+np.power(x2,2))',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.525994
Iterations 7
fitA_10=smf.logit(formula='y~((x1+np.power(x1,2))*(x2+np.power(x2,2)))*x3',data=dfA).fit()
Optimization terminated successfully.
Current function value: 0.514823
Iterations 9
model_list=[fitA_00,fitA_01,fitA_02,fitA_03,fitA_04,fitA_05,fitA_06,fitA_07,fitA_08,fitA_09,fitA_10]
model_results_df = pd.DataFrame({'model_name': ["fitA-%02d" %d for d in range(0,11)],
'num_coef': [model.params.size for model in model_list],
'LogLik': [model.llf for model in model_list],
'AIC': [model.aic for model in model_list],
'BIC': [model.bic for model in model_list]})
model_results_df.sort_values(by='AIC')
| model_name | num_coef | LogLik | AIC | BIC | |
|---|---|---|---|---|---|
| 8 | fitA-08 | 25 | -295.089801 | 640.179601 | 748.153804 |
| 9 | fitA-09 | 30 | -291.926924 | 643.853847 | 773.422891 |
| 10 | fitA-10 | 45 | -285.726813 | 661.453626 | 855.807191 |
| 4 | fitA-04 | 15 | -316.214501 | 662.429003 | 727.213525 |
| 6 | fitA-06 | 20 | -313.420726 | 666.841453 | 753.220815 |
| 7 | fitA-07 | 9 | -359.674991 | 737.349981 | 776.220694 |
| 2 | fitA-02 | 3 | -368.200684 | 742.401368 | 755.358272 |
| 3 | fitA-03 | 7 | -364.357366 | 742.714732 | 772.947509 |
| 5 | fitA-05 | 8 | -363.843567 | 743.687133 | 778.238878 |
| 0 | fitA-00 | 1 | -383.938682 | 769.877365 | 774.196333 |
| 1 | fitA-01 | 5 | -381.514568 | 773.029136 | 794.623976 |
According to AIC, model fitA_08 is the best model.
model_results_df.sort_values(by='BIC')
| model_name | num_coef | LogLik | AIC | BIC | |
|---|---|---|---|---|---|
| 4 | fitA-04 | 15 | -316.214501 | 662.429003 | 727.213525 |
| 8 | fitA-08 | 25 | -295.089801 | 640.179601 | 748.153804 |
| 6 | fitA-06 | 20 | -313.420726 | 666.841453 | 753.220815 |
| 2 | fitA-02 | 3 | -368.200684 | 742.401368 | 755.358272 |
| 3 | fitA-03 | 7 | -364.357366 | 742.714732 | 772.947509 |
| 9 | fitA-09 | 30 | -291.926924 | 643.853847 | 773.422891 |
| 0 | fitA-00 | 1 | -383.938682 | 769.877365 | 774.196333 |
| 7 | fitA-07 | 9 | -359.674991 | 737.349981 | 776.220694 |
| 5 | fitA-05 | 8 | -363.843567 | 743.687133 | 778.238878 |
| 1 | fitA-01 | 5 | -381.514568 | 773.029136 | 794.623976 |
| 10 | fitA-10 | 45 | -285.726813 | 661.453626 | 855.807191 |
According to BIC, fitA_04 is the best model.
3d)¶
Let's interpret the best models to gain insights on how the inputs influence the event probability.
How many features are statistically significant for the AIC identified best model?
Which statistically significant feature has the highest MAGNITUDE coefficient for the AIC identified best model?
3d) - SOLUTION¶
def my_coefplot(mod, figsize_use=(10,4)):
fig, ax = plt.subplots(figsize=figsize_use)
ax.errorbar(y =mod.params.index,
x =mod.params,
xerr = 2 * mod.bse,
fmt='o', color='k', ecolor='k', elinewidth=2, ms=10)
ax.axvline(x=0, linestyle='--', color='grey', linewidth=3.5)
ax.set_xlabel('coefficient value')
plt.show()
my_coefplot(fitA_08)
Eight coefficients are statistically significant for the AIC identified best model. THe x3[T.E]:x1 and x3[T.D]:x1 features has the highest magnitude coefficient for the AIC indentified best model.
my_coefplot(fitA_04)
Eight features are statistically significant for the BIC identified best model too.
Features x3[T.E]:x1 and x3[T.D]:x1 has the highest MAGNITUDE coefficient for the BIC identified best model.
Problem 04¶
Let's conclude this assignment by visualizing predictions from the best models. You wil predict the event probability for the AIC and BIC identified best models to aid the interpretations and conclusions drawn from examining the coefficients.
4a)¶
However, you cannot make predictions without first defining an appropriate visualization grid! All inputs must be included in the new data set. You cannot ignore or forget about any of the inputs! However, that does not mean all inputs must use the same number of unique values! This application involves 2 continuous inputs and 1 categorical input. The grid will consist of many unique values of the continuous inputs and each unique value of the categorical input. The grid will therefore consider combinations not observed in the training data!
Create a DataFrame, input_A_viz, that consists of a three columns. The column names must be the same as the input names in the dfA DataFrame. The continuous inputs must include 101 evenly spaced values between their training set minimums and maximums. All categories associated with the categorical input must be included in the grid.
Display the number of unique values associated with each column in input_A_viz to confirm it is created successfully.
4a) - SOLUTION¶
input_A_viz=pd.DataFrame([(x1,x2,x3) for x1 in np.linspace(dfA.x1.min(),dfA.x1.max(), num=101)
for x2 in np.linspace(dfA.x2.min(),dfA.x2.max(), num=101)
for x3 in dfA.x3.unique()],
columns=['x1','x2','x3'])
input_A_viz.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51005 entries, 0 to 51004 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 51005 non-null float64 1 x2 51005 non-null float64 2 x3 51005 non-null object dtypes: float64(2), object(1) memory usage: 1.2+ MB
input_A_viz.nunique()
x1 101 x2 101 x3 5 dtype: int64
4b)¶
Let's visualize the input grid before making predictions to confirm it has been setup correctly.
Create a scatter plot in Seaborn where the x semantic variable is associated with the x1 input and the y semantic variable is associated with the x2 input. The column facets must be associated with the x3 input.
4b) - SOLUTION¶
sns.relplot(data=input_A_viz,x='x1',y='x2',col='x3', col_wrap=2 )
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
4c)¶
You now have everything ready to make predictions!
Create a deep copy of input_A_viz and assign the copy to the df_A_viz object.
Create 2 new columns in df_A_viz. One of the columns must be named pred_prob_AIC and the other must be named pred_prob_BIC. Assign the predictions from the AIC identified best model to the pred_prob_AIC column. Assign the predictions from the BIC identified model to the pred_prob_BIC column.
Display the .head() of the df_A_viz object to the screen.
4c) - SOLUTION¶
df_A_viz=input_A_viz.copy()
df_A_viz['pred_prob_AIC']=fitA_08.predict(df_A_viz)
df_A_viz['pred_prob_BIC'] =fitA_04.predict(df_A_viz)
df_A_viz.head()
| x1 | x2 | x3 | pred_prob_AIC | pred_prob_BIC | |
|---|---|---|---|---|---|
| 0 | -5.270025 | -4.622544 | A | 0.002187 | 0.000134 |
| 1 | -5.270025 | -4.622544 | B | 0.105997 | 0.046565 |
| 2 | -5.270025 | -4.622544 | C | 1.000000 | 0.521364 |
| 3 | -5.270025 | -4.622544 | D | 0.798110 | 0.976944 |
| 4 | -5.270025 | -4.622544 | E | 0.018166 | 0.436589 |
4d)¶
You will now visualize the predictions using a scatter plot organized consisently with 4b). The difference though is that the marker color will be associated with a model predicted event probability!
Visualize the predictions on the input grid associated with the AIC identified best model by associating the hue semantic variable with pred_prob_AIC. You must use a diverging color palette in the Seaborn scatter plot.
4d) - SOLUTION¶
sns.relplot(data = df_A_viz, x='x1',y='x2',col='x3',hue='pred_prob_AIC', col_wrap=2)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.relplot(data =df_A_viz, x='x1',y='x2',col='x3',hue='pred_prob_BIC', col_wrap=2)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The difference in the predictied probability surfaces between the AIC and BIC identified best models are mainly showed in the graph when x3=C, in AIC, we can see that the highest probability appears at the center place and In BIC, we almost can not see any high probability place.
The AIC and BIC are different is because AIC only depends on the loglikihood and the number of coefficient, but BIC depends on loglikelihood, number of coefficient(features) and the number of rows. From the question we have done, we can see that when there are more rows, the BIC will punish it more.
lf_viz=df_A_viz.reset_index().rename(columns={'index':'rowid'}).\
melt(id_vars=['rowid','x1','x2','x3' ], value_vars=['pred_prob_AIC', 'pred_prob_BIC'])
lf_viz
| rowid | x1 | x2 | x3 | variable | value | |
|---|---|---|---|---|---|---|
| 0 | 0 | -5.270025 | -4.622544 | A | pred_prob_AIC | 0.002187 |
| 1 | 1 | -5.270025 | -4.622544 | B | pred_prob_AIC | 0.105997 |
| 2 | 2 | -5.270025 | -4.622544 | C | pred_prob_AIC | 1.000000 |
| 3 | 3 | -5.270025 | -4.622544 | D | pred_prob_AIC | 0.798110 |
| 4 | 4 | -5.270025 | -4.622544 | E | pred_prob_AIC | 0.018166 |
| ... | ... | ... | ... | ... | ... | ... |
| 102005 | 51000 | 4.796841 | 4.687443 | A | pred_prob_BIC | 0.999768 |
| 102006 | 51001 | 4.796841 | 4.687443 | B | pred_prob_BIC | 0.944738 |
| 102007 | 51002 | 4.796841 | 4.687443 | C | pred_prob_BIC | 0.690248 |
| 102008 | 51003 | 4.796841 | 4.687443 | D | pred_prob_BIC | 0.030825 |
| 102009 | 51004 | 4.796841 | 4.687443 | E | pred_prob_BIC | 0.637584 |
102010 rows × 6 columns
sns.relplot(lf_viz, x='x1',y='x2',col='x3',row='variable',hue='value')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)